Seeping Semantics: Linking Datasets using Word Embeddings for Data Discovery

نویسندگان

  • Raul Castro Fernandez
  • Essam Mansour
  • Abdulhakim A. Qahtan
  • Ahmed Elmagarmid
  • Ihab F. Ilyas
  • Samuel Madden
  • Mourad Ouzzani
  • Michael Stonebraker
  • Nan Tang
چکیده

Employees that spend more time finding relevant data than analyzing it suffer a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web today, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. To leverage word embeddings, we introduce coherent groups, a novel technique to combine them which works better than other state of the art alternatives for this problem. We implement SEMPROP as part of a discovery system we are building and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Word Embeddings from Tagging Data: A methodological comparison

The semantics hidden in natural language are an essential building block for a common language understanding needed in areas like NLP or the Semantic Web. Such information is hidden for example in lightweight knowledge representations such as tagging systems and folksonomies. While extracting relatedness from tagging systems shows promising results, the extracted information is often encoded in...

متن کامل

Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator

This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results a...

متن کامل

Turkish entity discovery with word embeddings

Entity-linking systems link noun phrase mentions in a text to their corresponding knowledge base entities in order to enrich a text with metadata. Wikipedia is a popular and comprehensive knowledge base that is widely used in entity-linking systems. However, long-tail entities are not popular enough to have their own Wikipedia articles. Therefore, a knowledge base created by using Wikipedia ent...

متن کامل

Radical-Based Hierarchical Embeddings for Chinese Sentiment Analysis at Sentence Level

Text representation in Chinese sentiment analysis is usually working at word or character level. In this paper, we prove that radical-level processing could greatly improve sentiment classification performance. In particular, we propose two types of Chinese radical-based hierarchical embeddings. The embeddings incorporate not only semantics at radical and character level, but also sentiment inf...

متن کامل

Leveraging Distributional Semantics for Multi-Label Learning

We present a novel and scalable label embedding framework for large-scale multi-label learning a.k.a ExMLDS (Extreme Multi-Label Learning using Distributional Semantics). Our approach draws inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings for natural language processing tasks. Learning s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018